home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Almathera Ten Pack 3: CDPD 3
/
Almathera Ten on Ten - Disc 3: CDPD3.iso
/
scope
/
201-220
/
scopedisk208
/
ibem
/
up100.lzh
/
Update_techie.doc
< prev
next >
Wrap
Text File
|
1991-03-21
|
4KB
|
70 lines
Techie info, or "How Does it Work?"
-----------------------------------
Don't read this unless you actually care how this program works.
The method is reminiscent of LZSS encoding. I don't know exactly what the
SS stands for, but this isn't LZSS anyway so I'm just going to call it LZF
(yes, the 'F' stands for Fehdrau; I can't afford vanity plates for my car
so this will have to do). Somebody tell me if I'm off-base using L & Z's
initials in the name. (Lempel & Ziv, I think their names were. Apologies
to them if my memory fails me.)
There is a control byte for the next 8 elements, which can either be new
bytes (corresponding bit set to 0) or three-byte references to strings in
the old program (corresponding bit set to 1). If a string matches only
one or two characters in the old file, the individual bytes are stored,
since that way it is either 9 (control bit + byte) or 18 (2 x (control_bit
+ byte)) bits instead of 25 (control bit plus three byte reference).
Three or more bytes are stored as a reference, since 25 bits is less
than 27+ (3+ x (control bit + byte)).
The references are to a 4096 byte window into the old file. They are
stored as 12 bits (hi->lo) of location and 12 bits (hi->lo) of length. A
reference of 4095,4095 is legal, since access outside of the window is
allowed, so long as it STARTS in the window. String accesses outside of
file bounds are not allowed and are cut off if they are found.
The 4096 byte window pointer starts out at the start of the old file. The
window itself is bounded by pointer-1024 and pointer+3071. However, the
bounds are modified (never the pointer, just the bounds) so that they are
never outside of the file bounds, except with files smaller than 4096 bytes
in which the upper bound is out of the file. The pointer increments by a
value equal to the length of the longest match if it is between 1 and 32
bytes. Matches longer than 32 bytes (arbitrary limit, seems to work best)
move the pointer to the end of that match in the old file. Using this
method the dearchiving process can also figure out which 4096 bytes of the
old file are in the window.
"Seems kinda slow." Yeah. Unlike LZSS and LZH, I can't use a binary
search tree to find the longest match because then I'd have to rebuild the
tree after every change in the window position. Instead, I had to
literally search the entire damn buffer. I got a little clever about it,
though, so that if it, say, had a five-byte match already, it wouldn't
check the first five bytes of any subsequent string unless the sixth byte
matched the given string. That's the best I could think of. Better
ideas, anyone? Anyway, using that method it's only slow on different
files. A perfect match is really very fast, and close matches are quite
respectable. If it goes too slow, the .up file is probably not going to
be very small anyway.
How well does it work? Quite, usually, for me. It's really meant as a
patch program, not a full upgrade deal. The problem with executables is
that 9/10 branches are relative, and therefore are changed if anything is
added or deleted between the branch and its destination, so the relative
offset changes and causes a string break. Similarly, if your compiler
uses a register to access a global data area (ie. Manx's use of A4), and
you add one single variable at the start of the data area, every
reference to a global variable will be different. Too much of that and
Update's not going to find many long strings. It's good for, and meant
for, fixing that nasty division by zero error, unallocated string,
version number, etc, but not changing 500 instances of printf() to
fprintf(stderr,). Get it?
By the way, if nothing ever matches, the resultant .up file will be 12.5%
larger than the original (9 bits per new byte).
For the future: Maybe add Huffman to the code. The control bytes, new
bytes, and buffer locations should be more or less random. However,
lengths will likely cluster. But that's another story.